08. Dataviz: Boxplot and Scatterplot

Preparation

Read 2.1.1 (scatter plots) and 2.1.5 (box plots)

Goals

Read scatter plots and box plots correctly.

Identify when each plot type is appropriate.

Generate your own plots from data using seaborn.

Learn the five number summary and compare to a box plot.

Note on Accessibility:

Data visualization is the production of charts and graphs that reveal trends and interrelationships to the eye. Although it can be done well by and for low-vision and colorblind users, dataviz is fundamentally visual, so it’s problematic for the blind. Data methods by and for the blind should strongly consider other techniques. Nevertheless, it’s important for both blind and sighted students of statistics to understand the larger discipline. I’ve tried to make these notes useful to everyone, understanding that a histogram might not be.

The Scatter Plot

A scatterplot is a visualization of two quantitative variables, appropriate when at least two numerical values are recorded about each data case. The \(x\) axis is given the range of values of one variable, and the \(y\) axis the other. Then each data case is represented as one point \((x,y)\) according to its variable’s values. A cloud of points arises and its shape is a clue to any association between the variables.

The Scatter Plot – Example

import seaborn as sns
df = pd.read_csv("county.csv")
sns.scatterplot(data=df, x='homeownership', y='multi_unit')
A scatterplot. The x-axis is labeled homeownership percent by county and ranges from 0 to 100. The y-axis is labeled multiunit dwelling percent by county and also ranges from 0 to 100. Hundreds of points inhabit the frame, most clustered in the bottom right (high homeownership, low multi-unit). A strong downward-sloping trend is visible, suggesting a negative correlation.
Figure 1: Scatterplot Comparing homeownership to multi-unit dwelling by U.S. county

The Scatter Plot – interpretation

sns.scatterplot(data=df, x='homeownership', y='multi_unit')
  1. What are typical, common values for homeownership and multi-unit dwelling? Why?
  2. Where would you find St. Lawrence County on the scatterplot? Where would you find Manhattan (“New York County”)?
  3. What sort of county is represented at the bottom right? What sort of counties are the outliers at the top?
  4. Is there an upward or downward trend? Speculate why.
A scatterplot. The x-axis is labeled homeownership percent by county and ranges from 0 to 100. The y-axis is labeled multiunit dwelling percent by county and also ranges from 0 to 100. Hundreds of points inhabit the frame, most clustered in the bottom right (high homeownership, low multi-unit). A strong downward-sloping trend is visible, suggesting a negative correlation.
Figure 2: Scatterplot Comparing homeownership to multi-unit dwelling by U.S. county

Scatterplot – example with hue

sns.lmplot(data=iris, x='petal_width', y='sepal_width', hue = 'species')
  1. Of the three, which species is easiest to identify? How is it recognized?
  2. What’s the best way to distinguish between the other two species?
  3. Do flowers with wider petals usually have wider sepals too?
  4. There are fewer dots on this scatterplot. What does that mean about the flower data?
A scatterplot using Fisher's Iris data. The x-axis is labeled Petal width and ranges from 0 to 2.5. The y-axis is labeled Sepal width and also ranges from 2 to 4.5. 150 points are shown in three main clusters, each with an upward trend line. Each cluster is color coded and labeled with a species name setosa, versicolor, and virginica.
Figure 3: Scatterplot petal width to sepal width for Fisher’s irises.

What’s a Box (-and-whisker) Plot?

  • A visualization of the distribution of one quantitative variable, an alternative to a histogram.
  • reveals the full range of data values along the \(x\)-axis.
  • divides the data range into four “equal” parts, called quartiles
    • the quartiles usually have unequal width, but
    • each quartile’s size is adjusted to include exactly a quarter of the data points.
    • the inner two quartiles are drawn as a box, and the outer two are drawn as “whiskers”
    • The five quartile boundary points are called: Min, 25%, 50%, 75%, and Max.

Box (-and-whisker) plot – Basic Example

sns.boxplot(data=df, x='poverty')
  1. Based on the boxplot, what is a typical homeownership percent for U.S. counties?
  2. For what x-regions are the data points tightly clustered? Where are they more thinly spread?
A Boxplot for poverty percent by county. The x-axis is labeled poverty and ranges from 0 to 50. The y-axis is unlabeled and unmarked. The boxplot has a central box from x=12 to x=20, divided bya centerline at x=16. The whiskers extend left to x=3 and right to x=32. The five number summary is 3,12,16, 20,50,
Figure 4: Boxplot for poverty percent by county.

Whisker Technicalities

Customarily, whiskers aren’t allowed to be more 1.5 times as long as boxes. If a boxplot would be drawn with long whiskers, trim them to 1.5 * [box size], and represent data beyond this length as individual dots. Both Python and OpenIntro do this. You need to know this to answer questions like “what’s the maximum data value?” using a boxplot.

Box (-and-whisker) plot – Rich Example

Since a boxplot is so narrow, it can be stacked together to compare many related distributions across categories:

sns.boxplot(data=df, x='Attack', y='Type')
An array of stacked boxplots. The x-axis is labeled 'Attack' and ranges from 0 to 200. The y-axis is labeled 'Type' and labeled with pokemon categories fire, water, grass, dragon, etc. A variety of box plots stretch horizontally, one on top of the other, one for each pokemon type.
Figure 5: Boxplot for Attack value of various pokemon, separated by type.

The Box Plot – more serious example

sns.boxplot(data = df, x = 'salary', y = 'team')
  1. The plots seem left-justified. What does that mean about salaries?
  2. Which teams pay the least? the most?
  3. What do the circles mean?
  4. Which team pays the highest median wage?
  5. Professor Howald is considering quitting math and playing major league ball. What’s a realistic salary expectation?
A Boxplot for MLB salary by team. The x-axis is labeled salary ($M) and ranges from 0 to 30. The y-axis is labeled Team. Many box plots are stacked together, displaying the distribution for each team.
Figure 6: Boxplot for MLB salary by team.

Plotting yourself

Once a dataframe is loaded, it’s not hard to make a scatterplot or boxplot:

import pandas as pd    #Needed once, not for every plot 
import seaborn as sns  #Needed once, not for every plot
df = pd.read_csv("filename.csv")
sns.boxplot(data=df, x='Attack', y='Type')
sns.scatterplot(data=df, x='homeownership', y='multi_unit')

See examples illustrated on previous slides.

Summary: Which plot type is best for each case?

  1. To illustrate how time spent studying relates to course grade.
  2. To show any relationship between religious identity and GPA.
  3. To illustrate a link between height and weight.
  4. To show the distribution of molecule sizes in a polymer.
  5. To illustrate total sales by product type.
  6. To visualize the masses and temperatures of thousands of stars.
  7. To show whether pokemon with higher attack also have higher defense.

Putting it all together

Load the OpenIntro Run17 data. Make your own plots to answer each question.

  1. What genders are represented and how many of each?
  2. How are the run times distributed?
  3. Are the run times unimodal? Bimodal? Something else? Why!?
  4. Can you compare run time distribution…
    • by event?
    • by gender?
    • by both?
  5. How are age and run time related?
    • Can you compare by gender? By event?

Individual work

On Webwork, called “Dataviz Box and Scatter.”